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Abstract 

Speecli repairs occur often in spontaneous spo- 
ken dialogues. The ability to detect and cor- 
rect those repairs is necessary for any spoken 
language system. We present a framework to 
detect and correct speech repairs where all rel- 
evant levels of information, i.e., acoustics, lexis, 
syntax and semantics can be integrated. The 
basic idea is to reduce the search space for re- 
pairs as soon as possible by cascading filters 
that involve more and more features. At first an 
acoustic module generates hypotheses about the 
existence of a repair. Second a stochastic model 
suggests a correction for every hypothesis. Well 
scored corrections are inserted as new paths in 
the word lattice. Finally a lattice parser decides 
on accepting the repair. 

1 Introduction 

Spontaneous speech is disfluent. In contrast 
to read speech the sentences aren't perfectly 
planned before they are uttered. Speakers of- 
ten modify their plans while they speak. This 
results in pauses, word repetitions or changes, 
word fragments and restarts. Current auto- 
matic speech understanding systems perform 
very well in small domains with restricted 
speech but have great difficulties to deal with 
such disfluencies. A system that copes with 
these self corrections (=repairs) must recognize 
the spoken words and identify the repair to get 
the intended meaning of an utterance. To char- 
acterize a repair it is commonly segmented into 
the following four parts (cf. figjl|): 

• reparandum: the "wrong" part of the ut- 
terance 

• interruption point (IP): marker at the end 
of the reparandum 

• editing term: special phrases, which indi- 
cate a repair like "well" , "I mean" or filled 



pauses such as "uhm" , "uh" 

reparans: the correction of the reparandum 



on Thursday 1 cannot 



I can meet uh after one 




Reparandum Interruption- Editing Reparans 
point Term 

Figure 1: Example of a self repair 

Only if reparandum and editing term are 
known, the utterance can be analyzed in the 
right way. It remains an open question whether 
the two terms should be deleted before a seman- 
tic analysis as suggested sometimes in the liter- 
atureQ If both terms are marked it is a straight- 
forward preprocessing step to delete reparan- 
dum and editing term. In the Verbmobil0 cor- 
pus, a corpus dealing with appointment schedul- 
ing and travel planning, nearly 21% of all turns 
contain at least one repair. As a consequence a 
speech understanding system that cannot han- 
dle repairs will lose performance on these turns. 

Even if repairs are defi ned by syntac tic and 
semantic well-formedness ( |Levelt, 19831) we ob- 
serve that most of them are local phenomena. 
At this point we have to differentiate between 
restarts and other repairs^ (modification re- 
pairs). Modification repairs have a strong corre- 
spondence between reparandum and reparans, 

^In most cases a reparandum could be deleted with- 
out any loss of information. But, for example, if it in- 
troduces an object which is referred to later, a deletion 
is not appropriate. 

^This work is part of the VERBMOBIL project and 
was funded by the German Federal Ministry for Research 
and Technology (BMBF) in the framework of the Verb- 
mobil Project under Grant BMBF 01 IV 701 VO. The 
responsibility for the contents of this study lies with the 
authors. 

■'often a third kind of repair is defined: "abridged 
repairs" . These repairs consist solely of an editing term 
and are not repairs in our sense. 



whereas restarts are less structured. In our be- 
lieve there is no need for a complete syntactic 
analysis to detect and correct most modification 
repairs. Thus, in what follows, we will concen- 
trate on this kind of repair. 

There are two major arguments to process 
repairs before parsing. Primarily spontaneous 
speech is not always syntactically well-formed 
even in the absence of self corrections. Sec- 
ond (Meta-) rules increase the parsers' search 
space. This is perhaps acceptable for transliter- 
ated speech but not for speech recognizers out- 
put like lattices because they represent millions 
of possible spoken utterances. In addition, sys- 
tems which are not based on a deep syntactic 
and semantic analysis - e.g. statistical dialog 
act prediction - require a repair processing step 
to resolve contradictions like the one in fig. ||. 

We propose an algorithm for word lattices 
that divides repair detection and correction in 
three steps (cf. fig. First, a trigger indi- 
cates potential IPs. Second, a stochastic model 
tries to find an appropriate repair for each IP by 
guessing the most probable segmentation. To 
accomplish this, repair processing is seen as a 
statistical machine translation problem where 
the reparandum is a translation of the reparans. 
For every repair found, a path representing the 
speaker's intended word sequence is inserted 
into the lattice. In the last step, a lattice parser 
selects the best path. 

on Thursday I cannot no I can meet uh after one 
I Speech recognizer j 
I acoustic detection of interruplion poinl 



on Thursday I cannot no I can meet uh after one 

local word-based scope detection of \ lattice editing to represent result 
Reparandum Editing '1 enii Reparans y 




on Thursday I can meet uh after one 



Figure 2: An architecture for repair processing 
2 Repair Triggers 

Because it is impossible for a real time speech 
system to check for every word whether it can 
be part of a repair, we use triggers which indi- 
cate the potential existence of a repair. These 



triggers must be immediately detectable for ev- 
ery word in the lattice. Currently we are using 
two different kind of trigger^: 

1. Acoustic/prosodic cues: Speakers mark the 
IP in many cases by prosodic signals like 
pauses, hesitations, etc. A prosodic classi- 
fier 1^ determines for every word the proba- 
bility of an IP following. If it is above a cer- 
tain threshold, the trigger becomes active. 
For a detailed description of the acoustic 
aspects see ( Batliner et al., 1998 ). 

2. Word fragments are a very strong repair 
indicator. Unfortunately, no speech recog- 
nizer is able to detect word fragments to 
date. But there are some interesting ap- 
proaches to detect words which are not in 



the recognizers vocabulary (Klakow et al. 



199£). A word fragment is normally an un- 



known word and we hope that it can be 
distinguished from unfragmented unknown 
words by the prosodic classifier. So, cur- 
rently this is a hypothetical trigger. We 
will elaborate on it in the evaluation sec- 
tion (cf. sect. ^) to show the impact of this 
trigger. 

If a trigger is active, a search for an acceptable 
segmentation into reparandum, editing term 
and reparans is initiated. 

3 Scope Detection 

As mentioned in the introduction repair seg- 
mentation is based mainly on a stochastic trans- 
lation model. Before we explain it in detail we 
give a short introduction to statistical machine 
translation]^. The fundamental idea is the as- 
sumption that a given sentence 5" in a source 
language (e.g. English) can be translated in any 
sentence T in a target language (e.g. German). 
To every pair (5, T) a probability is assigned 
which reflects the likelihood that a translator 
who sees S will produce T as the translation. 
The statistical machine translation problem is 



Other triggers can be added as well, ( ^tolcke et al. 



|1999| ) for example integrate prosodic cues and an ex 
tended language model in a speech recognizer to detect 
IPs. 

^The classifier is developed by the speech group of 
the IMMD 5. Special thanks to Anton Batliner, Richard 
Huber and Volker Warnke. 

m ore detailed introduction is given by (Brown et 



1990| ) 



formulated as: 



T = argmaxxP{T\S) 



(1) 



This is reformulated by Bayes' law for a better 
search space reduction, but we are only inter- 
ested in the conditional probability P{T\S). For 
further processing steps we have to introduc e 
the concept of alignment ( Brown et al., 1990 ). 
Let S be the word sequence Si, S2 ■ ■ ■ Si = S[ 
and T = Ti, Ts . . . = T^. We can link a 
word in T to a word in S. This reflects the 
assumption that the word in T is translated 
from the word in S. For example, if S is "On 
Thursday" and T is "Am Donnerstag" "Am" 
can be linked to "On" but also to "Thursday". 
If each word in T is linked to exactly one word 
in S these links can be described by a vector 
a™ = ai . . . am with a-j € . . . L If the word Tj 
is linked to Si then aj = i. If it is not connected 
to any word in S then aj = 0. Such a vector 
is called an alignment a. P{T\S) can now be 
expressed by 

P{T\S)= PiT,a\S) (2) 

a is alignment 

Without any further assumptions we can infer 
the following: 



P(r,a|5) = P{m\S) 



aj\a{ ^,Ti ^,m,S) 



P{T,\a\,T(~\m,S) (3) 

Now we return to self corrections. How can this 
framework help to detect the segments of a re- 
pair? Assume we have a lattice path where the 
reparandum (RD) and the reparans(i2S') are 
given, then (RS, RD) can be seen as a transla- 
tion pair and P{RD\RS) can be expressed ex- 
actly the same way as in equation (|^). Hence 
we have a method to score {RS, RD) pairs. But 
the triggers only indicate the interruption point, 
not the complete segmentation. Let us first 
look at editing terms. We assume them to be 
a closed list of short phrases. Thus if an entry 
of the editing term list is found after an IP, the 
corresponding words are skipped. Any subse- 
quence of words before/after the IP could be the 
reparandum/reparans. Because turns can have 
an arbitrary length it is impossible to compute 
P{RD\RS) for every {RS,RD) pair. But this 



is not necessary at all, if repairs are considered 
as local phenomena. We restrict our search to a 
window of four words before and after the IP. A 
corpus analysis showed that 98% of all repairs 
are within this window. Now we only have to 
compute probabilities for 4^ different pairs. If 
the probability of a {RS, RD) pair is above a 
certain threshold, the segmentation is accepted 
as a repair. 

3.1 Parameter Estimation 

The conditional probabilities in equation (|3|) 
cannot be estimated reliably from any corpus 
of realistic size, because there are too many pa- 
rameters. For example both P in the product 
depend on the complete reparans RS. There- 
fore we simplify the probabilities by assuming 
that m depends only on /, aj only on j,m and 
I and finally RDj on RSa^ ■ So equation (^) be- 
comes 

P{RD,a\RS) = P{m\l)* 

m 

l[P{a,\j,m,l)*P{RD,\RSa^) (4) 
i=i 

These probabilities can be directly trained from 
a manually annotated corpus, where all repairs 
are labeled with begin, end, IP and editing term 
and for each reparandum the words are linked 
to the corresponding words in the respective 
reparans. All distributions are smoothed by a 
simple back-off method ( [Katz, 1987 ) to avoid 
zero probabilities with the exception that the 
word replacement probability P{RDj\RSaj) is 
smoothed in a more sophisticated way. 

3.2 Smoothing 

Even if we reduce the number of parameters for 
the word replacement probability by the sim- 
plifications mentioned above there are a lot of 
parameters left. With a vocabulary size of 2500 
words, 2500^ parameters have to be estimated 
for P{RDj\RSaj). The corpusQ contains 3200 
repairs from which we extract about 5000 word 
links. So most of the possible word links never 
occur in the corpus. Some of them are more 
likely to occur in a repair than others. For ex- 
ample, the replacement of "Thursday" by "Fri- 
day" is supposed to be more likely than by "eat- 
ing", even if both replacements are not in the 
training corpus. Of course, this is related to 



-llOOOturns with ~240000 words 



the fact that a repah' is a syntactic and/or se- 
mantic anomaly. We make use of it by adding 
two additional knowledge sources to our model. 
Minimal syntactic information is given by part- 
of-speech (POS) tags and POS sequences, se- 
mantic information is given by semantic word 
classes. Hence the input is not merely a se- 
quence of words but a sequence of triples. Each 
triple has three slots (word, POS tag, seman- 
tic class). In the next section we will describe 
how we obtain these two information pieces for 
every word in the lattice. With this additional 
information, P{RDj\RSa^) probability could be 
smoothed by linear interpolation of word, POS 
and semantic class replacement probabilities. 

P{RDj\RSa^) = 

a * P{Word{RDj)\Word{RSa^)) 

+ (3 * P{SemClass{RDj)\SemClass{RSa^)) 

+ j*P{POS{RDj)\POS{RSa^)) (5) 

with a + (3 + ^ = 1. 

Word{RDj) is the notation for the selector of 
the word slot of the triple at position j. 

4 Integration with Lattice 
Processing 

We can now detect and correct a repair, given a 
sentence annotated with POS tags and seman- 
tic classes. But how can we construct such a 
sequence from a word lattice? Integrating the 
model in a lattice algorithm requires three steps: 

• mapping the word lattice to a tag lattice 

• triggering IPs and extracting the possible 
reparandum/reparans pairs 

• introducing new paths to represent the 
plausible reparans 

The tag lattice co nstruction is adapted from 
( Samuelsson, 19971) . For every word edge and 
every denoted POS tag a corresponding tag 
edge is created and the resulting probability 
is determined. If a tag edge already exists, 
the probabilities of both edges are merged. 
The original words are stored together with 
their unique semantic class in a associated list. 
Paths through the tag graph are scored by a 
POS-trigram. If a trigger is active, all paths 
through the word before the IP need to be tested 
whether an acceptable repair segmentation ex- 
ists. Since the scope model takes at most four 
words for reparandum and reparans in account 



it is sufficient to expand only partial paths. 
Each of these partial paths is then processed by 
the scope model. To reduce the search space, 
paths with a low score can be pruned. 

Repair processing is integrated into the Verb- 
mobil system as a filter process between speech 
recognition and syntactic analysis. This en- 
forces a repair representation that can be inte- 
grated into a lattice. It is not possible to mark 
only the words with some additional informa- 
tion, because a repair is a phenomenon that de- 
pends on a path. Imagine that the system has 
detected a repair on a certain path in the lattice 
and marked all words by their repair function. 
Then a search process (e.g. the parser) selects a 
different path which shares only the words of the 
reparandum. But these words are no reparan- 
dum for this path. A solution is to introduce a 
new path in the lattice where reparandum and 
editing terms are deleted. As we said before, we 
do not want to delete these segments, so they 
are stored in a special slot of the first word of 
the reparans. The original path can now be re- 
construct if necessary. 

To ensure that these new paths are compa- 
rable to other paths we score the reparandum 
the same way the parser does, and add the re- 
sulting value to the first word of the reparans. 
As a result, both the original path and the one 
with the repair get the same score except one 
word transition. The (probably bad) transition 
in the original path from the last word of the 
reparandum to the first word of the reparans is 
replaced by a (probably good) transition from 
the reparandum's onset to the reparans. We 
take the lattice in fig. |2| to give an example. 
The scope model has marked "I cannot" as the 
reparandum, "no" as an editing term, and "I 
can" as the reparans. We sum up the acoustic 
scores of "I" , " can" and "no" . Then we add the 
maximum language model scores for the tran- 
sition to "I", to "can" given "I", and to "no" 
given "I" and "can". This score is added as an 
offset to the acoustic score of the second "I" . 

5 Results and Further Work 

Due to the different trigger situations we per- 
formed two tests: One where we use only 
acoustic triggers and another where the exis- 
tence of a perfect word fragment detector is as- 
sumed. The input were unsegmented translit- 
erated utterance to exclude influences a word 



recognizer. We restrict the processing time on 
a SUN/ULTRA 300MHZ to 10 seconds. The 
parser was simulated by a word trigram. Train- 
ing and testing were done on two separated 
parts of the German part of the Verbmobil cor- 
pus (12558 turns training / 1737 turns test). 





Detection 


Correct scope 




Recall 


Precision 


Recall 


Precision 


Test 1 


49% 


70% 


47 % 


70% 


Test 2 


71% 


85% 


62% 


83% 



A direct comparison to other groups is rather 
difficult due to very different corpora, eval- 
uation conditions and goals. ([Nakatani and| 
Hirschberg, 1993| ) suggest a acoustic/prosodic 



detector to identify IPs but don't discuss the 
problem of finding the correct segmentation in 
depth. Also their results are obtained on a 
corpus where every utterance contains at least 
one repair, ( ^hriberg, 1994 ) also addresses the 
acoustic aspects of repairs. Parsing approaches 
like in (Pear et al., 199^ ; [Kindle, 198^ ; [Core and 
Schubert, 1999) must be proved to work with 
lattices rather than transliterated text. An al- 
gorithm which is inherently capable of lattice 
processing is proposed by Heeman ( |Heemaii] 
1997). He redefines the word recognition prob- 



lem to identify the best sequence of words, cor- 
responding POS tags and special repair tags. 
He reports a recall rate of 81% and a precision 
of 83% for detection and 78%/80% for correc- 
tion. The test settings are nearly the same as 
test 2. Unfortunately, nothing is said about the 
processing time of his module. 

We have presented an approach to score po- 
tential reparandum/reparans pairs with a rela- 
tive simple scope model. Our results show that 
repair processing with statistical methods and 
without deep syntactic knowledge is a promis- 
ing approach at least for modification repairs. 
Within this framework more sophisticated scope 
models can be evaluated. A system integration 
as a filter process is described. Mapping the 
word lattice to a POS tag lattice is not optimal, 
because word information is lost in the search 
for partial paths. We plan to implement a com- 
bined combined POS/word tagger. 
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